Inference of Isoforms from Short Sequence Reads

نویسندگان

  • Jianxing Feng
  • Wei Li
  • Tao Jiang
چکیده

Due to alternative splicing events in eukaryotic species, the identification of mRNA isoforms (or splicing variants) is a difficult problem. Traditional experimental methods for this purpose are time consuming and cost ineffective. The emerging RNA-Seq technology provides a possible effective method to address this problem. Although the advantages of RNA-Seq over traditional methods in transcriptome analysis have been confirmed by many studies, the inference of isoforms from millions of short sequence reads (e.g., Illumina/Solexa reads) has remained computationally challenging. In this work, we propose a method to calculate the expression levels of isoforms and infer isoforms from short RNA-Seq reads using exon-intron boundary, transcription start site (TSS) and poly-A site (PAS) information. We first formulate the relationship among exons, isoforms, and single-end reads as a convex quadratic program, and then use an efficient algorithm (called IsoInfer) to search for isoforms. IsoInfer can calculate the expression levels of isoforms accurately if all the isoforms are known and infer novel isoforms from scratch. Our experimental tests on known mouse isoforms with both simulated expression levels and reads demonstrate that IsoInfer is able to calculate the expression levels of isoforms with an accuracy comparable to the state-of-the-art statistical method and a 60 times faster speed. Moreover, our tests on both simulated and real reads show that it achieves a good precision and sensitivity in inferring isoforms when given accurate exon-intron boundary, TSS, and PAS information, especially for isoforms whose expression levels are significantly high. The software is publicly available for free at http://www.cs.ucr.edu/∼jianxing/IsoInfer.html.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Inference of Isoforms from Short Sequence

Due to alternative splicing events in eukaryotic species, the identification of mRNA isoforms (or splicing variants) is a difficult problem. Traditional experimental methods for this purpose are time consuming and cost ineffective. The emerging RNA-Seq technology provides a possible effective method to address this problem. Although the advantages of RNA-Seq over traditional methods in transcri...

متن کامل

RNA-Seq gene expression estimation with read mapping uncertainty

MOTIVATION RNA-Seq is a promising new technology for accurately measuring gene expression levels. Expression estimation with RNA-Seq requires the mapping of relatively short sequencing reads to a reference genome or transcript set. Because reads are generally shorter than transcripts from which they are derived, a single read may map to multiple genes and isoforms, complicating expression analy...

متن کامل

A Robust Method for Transcript Quantification with RNA-seq Data

The advent of high throughput RNA-seq technology allows deep sampling of the transcriptome, making it possible to characterize both the diversity and the abundance of transcript isoforms. Accurate abundance estimation or transcript quantification of isoforms is critical for downstream differential analysis (e.g., healthy vs. diseased cells) but remains a challenging problem for several reasons....

متن کامل

Accurate Identification and Analysis of Human mRNA Isoforms Using Deep Long Read Sequencing

Precise identification of RNA-coding regions and transcriptomes of eukaryotes is a significant problem in biology. Currently, eukaryote transcriptomes are analyzed using deep short-read sequencing experiments of complementary DNAs. The resulting short-reads are then aligned against a genome and annotated junctions to infer biological meaning. Here we use long-read complementary DNA datasets for...

متن کامل

Haplotype Inference from Single Short Sequence Reads Using a Population Genealogical History Model

High-throughput sequencing is currently a major transforming technology in biology. In this paper, we study a population genomics problem motivated by the newly available short reads data from high-throughput sequencing. In this problem, we are given short reads collected from individuals in a population. The objective is to infer haplotypes with the given reads. We first formulate the computat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Journal of computational biology : a journal of computational molecular cell biology

دوره 18 3  شماره 

صفحات  -

تاریخ انتشار 2010